Gittins and Whittle's Index
Bandit Problem
Markov Bandit Process
-
MDP on a countable state space, where
is the state of the bandit at the discrete decision time . -
Controls applied at decision time
: freezes the process and gives no reward continues the process and gives instantaneous reward .
is the bounded reward
is the discount factor
个 bandit 每次仅“服务”一个 bandit,其余的状态不变 Markov Bandit Processes with state space - Control
is applied to a single bandit it at each decision time . - All other bandits remain in the same state.
- Other Notations
: Sequence of selected bandits : State of the selected bandit at each decision time : Transition probability
Objective Function
- Problem
Sequentially allocate effort between different processes to maximize the infinite-horizon expected discounted sum of rewards.
At time
, we know the state , the probabilities, the discount factor and the reward function for each bandit.
Examples
Gittins Index
Multi Armed Bandit Problem (open problem for almost 40 years)
Index Policy
Notice that it only depends on the parameters associated with a single bandit. But how such function should be designed?
Optimal policy for this problem is an Index policy.
Derivation of the Index – Single bandit with charge
- Consider a single bandit
with a “playing charge” of .
if at time
- Then get Optimal Reward:
- For every
, such that there is a null reward for playing:
- it has a single root which is the Gittins Index,
, given by:
- This
is called the fair charge during state that makes it equally desirable to play and to stop.
Gittins Index
Going back to the Simple Family of Alternative Bandit Processes with
where τ is the stopping-time.
Numerator is the discounted REWARD up to time
. Denominator is the discounted TIME up to time .
- GITTINS INDEX POLICY chooses the bandit with highest
at every decision time .
“greatest per period rent that one would be willing to pay for ownership of the rewards arising from the bandit as it is continued for one or more periods.”
因为 index 代表了在不赚不赔时付出的 cost(fair charge),因此对每个 bandit愿意付出的 cost 越大,说明其可能获得的 reward 越高
- Gittins Index Policy is optimal.
【Main ideas in the proof】
- We always choose the bandit with larger current reward density value.
- There is no “opportunity cost” since other bandits are frozen.(Breaks down when bandits are restless)
This proof is instructive because:
- provides insight into why the Gittins Index Policy is optimal
- provides insight into why it is NOT optimal for the restless case;
- Necessary Conditions for Gittins
- Control space is finite
- Infinite Horizon
- Constant exponential discounting
- Single processor/server
More details: @GittinsIndexMultiarmed
Whittle Index
Restless Multi Armed Bandit Problem
- Whittle extends the notion of index to restless bandits.
- At each time
, exactly out of bandits are given the action
- Action
no longer freezes the bandit. (passive bandits may change state and accrue reward.)
- At each time
Three Optimization Problems
Original
Relaxed
Problem with Relaxed activation constraint.
Lagrange
The Lagrange Dual Function is given by:
Notice that we can decouple this problem and neglect the last term (constant).
【Decoupled Problem】(Similar to Gittins)
- Solution to the decoupled problem
The optimal policy for the Decoupled Problem may NOT be a stopping rule. In general, the optimal policy divides the state space into two subsets:- Let
be the set of ALL states for which it is optimal to idle when the playing charge is . - The set
is characterized by the solution of the Decoupled Problem. - Optimal Policy: play, if
; stop, otherwise.
- Let
Indexability
The Decoupled Problem associated with bandit
Means that if a bandit is rested with
WI policy
Consider the Decoupled Problem and denote by
考虑到前面 Original Problem 的约束,可以得到如下 WI policy:
【Whittle Index Policy】 At every decision time
- The Index Policy is a low-complexity heuristic.
- Challenge
- the Index Policy is only defined for problems that are indexable, a condition that is often difficult to establish.
- Moreover, it is often hard to find a closed-form expression to
.
- If our RMAB problem is actually a MAB, then Whittle
Gittins. Thus, in this case, Whittle is optimal.